Improving Data Cleaning Quality Using a Data Lineage Facility
نویسندگان
چکیده
The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for some applications, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. One important challenge with them is the design of a data flow graph that effectively generates clean data. A generalized difficulty is the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper presents a solution to handle this problem by enabling users to express user interactions declaratively and tune data cleaning programs.
منابع مشابه
NEW TECHNIQUES FOR IMPROVING BIOLOGICAL DATA QUALITY THROUGH INFORMATION INTEGRATION by
NEW TECHNIQUES FOR IMPROVING BIOLOGICAL DATA QUALITY THROUGH INFORMATION INTEGRATION by Katherine Grace Herbert As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems ...
متن کاملA revival of integrity constraints for data cleaning
Integrity constraints, a.k.a. data dependencies, are being widely used for improving the quality of schema. Recently constraints have enjoyed a revival for improving the quality of data. The tutorial aims to provide an overview of recent advances in constraint-based data cleaning.
متن کاملBiological data cleaning: a case study
As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems is non-trivial and can cause many problems for the database. For biological data to be corrected and standardise...
متن کاملImproving Data Quality in Intelligent Transportation Systems
—Intelligent Transportation Systems (ITS) use data and information technology to improve the operation of our transportation network. ITS contributes to sustainable development by using technology to make the transportation system more efficient; improving our environment by reducing emissions, reducing the need for new construction and improving our daily lives through reduced congestion. A ke...
متن کاملData Quality in Data Warehouses
Fayyad and Uthursamy (2002) have stated that the majority of the work (representing months or years) in creating a data warehouse is in cleaning up duplicates and resolving other anomalies. This article provides an overview of two methods for improving quality. The first is data cleaning for finding duplicates within files or across files. The second is edit/imputation for maintaining business ...
متن کامل